Part 4. Regression (part a)
It’s great if your estimator is as simple as a sample mean \(\overline{X}\) or a difference in sample means \(\overline{Y}_1 - \overline{Y}_0\).
But \(>95\%\) of the time researchers in social science use OLS regression:
To understand it we’ll use (almost) everything we have learned.
Plug-in estimator: to get \(\hat{\alpha}\), \(\hat{\beta}\), plug in sample covariance, sample variance, sample means to solution above
Given a prediction \(\hat{Y}_i\) for an observation \(i\), the residual is
\[e_i = Y_i - \hat{Y}_i\]
If the prediction \(\hat{Y}_i\) is based on a true population predictor (e.g. CEF or BLP), then \(e_i = Y_i - \hat{Y}\) can be called an error.
(Residual : error :: Sample mean \(\overline{X}\) : expectation \({\textrm E}[X]\))
Plug-in estimator (Ordinary Least Squares, OLS):
\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2,\] i.e. choose coefficients to minimize the mean of the squared residuals in the sample
Our problem is \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\,\left(Y_i - (b_0 + b_1X_{[1]i} + \ldots + b_K X_{[K]i})\right)^2,\]
We could try all possible vectors of coefficients to find \((\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K})\).
The solution (via linear algebra and calculus) is \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]
where \(\mathbb{X}\) is an \(n \times K+1\) matrix of predictors and \(\mathbf{Y}\) is an \(n \times 1\) column vector (the dependent variable).
We’ll discuss more.
For now we use lm() or estimatr::lm_robust() in R.
Using CCES data,
Regressing env (priority for jobs over environment, 1-5) on aa (opposition to affirmative action, 1-4) and female (0-1):
Call:
lm(formula = env ~ aa + female, data = dat)
Coefficients:
(Intercept) aa female
1.10335 0.68121 -0.00335
The result is an estimate of a BLP of the form \[\text{Env}_i = \beta_0 + \beta_1 \text{AffAct}_i + \beta_2 \text{Female}_i\]
Call:
lm(formula = env ~ aa + female, data = dat)
Residuals:
Min 1Q Median 3Q Max
-2.8282 -0.8248 0.1718 0.8564 3.2188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.10335 0.03735 29.545 <2e-16 ***
aa 0.68121 0.01118 60.930 <2e-16 ***
female -0.00335 0.02319 -0.145 0.885
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.095 on 9179 degrees of freedom
Multiple R-squared: 0.2905, Adjusted R-squared: 0.2903
F-statistic: 1879 on 2 and 9179 DF, p-value: < 2.2e-16
my_reg_0 <- lm(env ~ aa, data = dat)
my_reg_1 <- lm(env ~ female, data = dat)
modelsummary::modelsummary(list(my_reg_0, my_reg_1, my_reg),
stars = T,
gof_map = c("r.squared", "nobs"))| (1) | (2) | (3) | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| (Intercept) | 1.101*** | 3.181*** | 1.103*** |
| (0.035) | (0.018) | (0.037) | |
| aa | 0.681*** | 0.681*** | |
| (0.011) | (0.011) | ||
| female | -0.156*** | -0.003 | |
| (0.027) | (0.023) | ||
| R2 | 0.290 | 0.004 | 0.291 |
| Num.Obs. | 9182 | 9182 | 9182 |
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call" "terms" "model"
(Intercept) aa female
1.103350634 0.681206314 -0.003350409
1 2 3 4 5 6
0.1718241 -0.7812065 1.1751745 1.2187935 -1.1469696 -1.4657633
[1] "call" "terms" "residuals" "coefficients" "aliased" "sigma" "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.103350634 0.03734519 29.5446489 2.425620e-183
aa 0.681206314 0.01118021 60.9296538 0.000000e+00
female -0.003350409 0.02318507 -0.1445072 8.851031e-01
The \(R^2\) measures how predictive the predictors are (in-sample):
Let \(e_i = Y_i - \hat{Y}_i\) be the residual for \(i\) from the model.
Let \(\varepsilon_i = Y_i - \overline{Y}\) be the residual for \(i\) from a null model – no predictors.
“By what proportion does the mean squared residual go down (compared to null model) when we include our predictors?”
\[\begin{align}R^2 &= \frac{\frac{1}{n} \sum_{i = 1}^n \varepsilon_i^2 - \frac{1}{n} \sum_{i = 1}^n e_i^2}{\frac{1}{n} \sum_{i = 1}^n \varepsilon_i^2} \\ &= 1 - \frac{\sum_{i = 1}^n e_i^2}{\sum_{i = 1}^n \varepsilon_i^2}\end{align}\]
There is a CEF but we’re not claiming to be able to find it. (cf Gauss-Markov theorem)
Instead, trying to estimate a BLP, whose form we choose by deciding what \(X_{[1]}, X_{[2]}, \ldots, X_{[K]}\) are (what variables, what transformations, what interactions).
The BLP is the “best” approximation to \(Y\) and the CEF given the chosen predictors, and OLS is a consistent estimator of it, but the BLP might not be helpful!
So how do we choose a BLP, i.e. a specification?
When does the BLP tell us something useful?
In the simple case with continuous \(X\) and \(Y\), our BLP was a line: \(\alpha + \beta X\).
The BLP is linear in the sense that it is the weighted sum of components \(X_{[1]}, X_{[2]}, \ldots, X_{[K]}\):
\[\beta_0 + \beta_1 X_{[1]} + \beta_2 X_{[2]} + \ldots + \beta_K X_{[K]}\]
But the BLP does not need to be a linear function of any particular variable, e.g. GDPPC or education. For example:
Butler & Broockman conducted an audit experiment where they emailed about 6000 US state legislators asking for help with registering to vote.
Key variables:
treat_deshawn (1 if email from “Deshawn Jackson”, 0 if email from “Jake Mueller”)reply_atall (1 if legislator responded, 0 otherwise)leg_R (1 if legislator Republican, 0 otherwise)
Call:
lm(formula = reply_atall ~ treat_deshawn, data = bb)
Coefficients:
(Intercept) treat_deshawn
0.57425 -0.01782
What is the reply rate to emails
With a single binary \(X\), the BLP \(\alpha + \beta X\) is also the CEF (two expectations).
We can make R give us the two sample means in various ways:
(Intercept) treat_deshawn
0.57424928 -0.01782424
R omits one category; each other category gets a coefficient indicating difference from the omitted categoryRecall leg_R indicates whether the email recipient was a Republican.
(Intercept) treat_deshawn leg_R
0.54662798 -0.01775660 0.06177313
If we write the prediction equation as \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \mathrm{DeShawn}_i + \hat{\beta}_2 \mathrm{LegRep}_i,\) then we have
| Recipient | Predicted response rate | |
|---|---|---|
| Jake | Dem | \(\hat{\beta}_0\) |
| DeShawn | Dem | \(\hat{\beta}_0 + \hat{\beta}_1\) |
| Jake | Rep | \(\hat{\beta}_0 + \hat{\beta}_2\) |
| DeShawn | Rep | \(\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2\) |
So what is the predicted DeShawn-vs-Jake difference for Democrats? For Republicans?
To make predictions match actual means (and allow effect of treatment to vary by party of legislator): add an interaction
We start with
\[Y_i = a_0 + a_1 X_i + a_2 D_i\]
Let’s let \(a_2\) depend (linearly) on the value of \(X_i\)!
\[a_2 = b_0 + b_1 X_i \]
Substitute in \[Y_i = a_0 + a_1 X_i + (b_0 + b_1 X_i) D_i \] and rearrange \[Y_i = a_0 + a_1 X_i + b_0 D_i + b_1 X_i D_i\]
(Same result if we make \(a_1\) depend on \(D_i\).)
(Intercept) treat_deshawn leg_R treat_deshawn:leg_R
0.52976190 0.01596300 0.09949293 -0.07550407
If we write the prediction equation as \[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \mathrm{DeShawn}_i + \hat{\beta}_2 \mathrm{LegRep}_i + \hat{\beta}_3 \mathrm{DeShawn}_i \times \mathrm{LegRep}_i\], then we have
| Recipient | Predicted response rate | |
|---|---|---|
| Jake | Dem | \(\hat{\beta}_0\) |
| DeShawn | Dem | \(\hat{\beta}_0 + \hat{\beta}_1\) |
| Jake | Rep | \(\hat{\beta}_0 + \hat{\beta}_2\) |
| DeShawn | Rep | \(\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_3\) |
You can arrange the regression to tell you means more directly:
bb$email_party <- paste(ifelse(bb$treat_deshawn == 1, "Deshawn", "Jake"),
ifelse(bb$leg_R == 0, "D", "R"),
sep = "_")
table(bb$email_party)
Deshawn_D Deshawn_R Jake_D Jake_R
1345 1083 1344 1087
(Intercept) email_partyDeshawn_R email_partyJake_D email_partyJake_R
0.54572491 0.02398885 -0.01596300 0.08352992
email_partyDeshawn_D email_partyDeshawn_R email_partyJake_D email_partyJake_R
0.5457249 0.5697138 0.5297619 0.6292548
Inference procedure for a single coefficient \(\hat{\beta_1}\) is just like inference for a sample mean:
lm()) assume regression errors
estimatr::lm_robust()) assume regression errors are independent, but may have different variances (heteroskedasticity)clusters option for estimatr::lm_robust()) assume errors may be correlated within defined clustersOften we are interested in linear combinations of coefficients, e.g.
If we want a CI or \(p\)-value, we need the standard error of a (weighted) sum of coefficients.
Recall variance rule: in general form, \[{\textrm V}[aX + bY + c] = a^2{\textrm V}[X] + b^2{\textrm V}[Y] + 2 a b \text{Cov}[X, Y]\]
Is e.g. \(\text{Cov}[\hat{\beta}_0, \hat{\beta}_1] = 0\) in \(Y_i = \hat{\beta}_0 + \hat{\beta}_1 \text{Deshawn}_i\)?
(Intercept) treat_deshawn leg_R treat_deshawn:leg_R
(Intercept) 0.00018 -0.00018 -0.00018 0.00018
treat_deshawn -0.00018 0.00036 0.00018 -0.00036
leg_R -0.00018 0.00018 0.00041 -0.00041
treat_deshawn:leg_R 0.00018 -0.00036 -0.00041 0.00081
Standard error of effect of Deshawn email for Republicans:
Here is a dataset on U.S. presidential elections from 1948 to 2016:
| Variable Name | Description |
|---|---|
year |
Election year |
deminc |
1=incumbent is a Democrat |
incvote |
Percent of the two-party vote for the incumbent party |
q2gdp |
Second-quarter change in real GDP in election year |
juneapp |
Net approval of incumbent president in June of election year |
Call:
lm(formula = incvote ~ juneapp, data = pres)
Coefficients:
(Intercept) juneapp
51.0368 0.1649
Avoid causal language in describing the coefficient on juneapp. Say e.g.
“One percentage point higher presidential approval in June is associated with .16 percentage points higher vote share for the president’s party in the November election”
Call:
lm(formula = incvote ~ poly(juneapp, degree = 3, raw = T), data = pres)
Residuals:
Min 1Q Median 3Q Max
-7.1491 -1.4341 0.3028 1.5728 5.9915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.097e+01 1.063e+00 47.932 < 2e-16 ***
poly(juneapp, degree = 3, raw = T)1 2.165e-01 7.136e-02 3.034 0.00893 **
poly(juneapp, degree = 3, raw = T)2 3.833e-04 1.560e-03 0.246 0.80955
poly(juneapp, degree = 3, raw = T)3 -2.811e-05 4.131e-05 -0.681 0.50725
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.33 on 14 degrees of freedom
Multiple R-squared: 0.6723, Adjusted R-squared: 0.6021
F-statistic: 9.576 on 3 and 14 DF, p-value: 0.001076
As you add more polynomials (more generally, add predictors to the model), you reduce the mean squared residual (MSE) in the sample. But you may be increasing mean squared residual (MSE) in the population due to overfitting.
Suppose the dots below are the population, and the red line is the CEF:
We will take samples of size \(n\) from the population (size \(N\)) and estimate the BLP with different orders of polynomial \(k\):
\[Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \ldots + \beta_k X_i^k \]
As we increase \(k\), what will happen to mean squared residual in sample (i.e. \(\frac{1}{n} \sum (Y_i - \hat{Y}_i)^2\)) vs in population (i.e. \(\frac{1}{N} \sum (Y_i - \hat{Y}_i)^2\))?
How does this depend on sample size \(n\)?